STATS 32 Session 5: Data Transformation

Kenneth Tay

Oct 8, 2019

Recap of week 2: data visualization

Different types of plots for visualizing different types of variables
Grammar of graphics for ggplot2
- Essential elements: data, geometries, aesthetics
- Layers allow us to build more complex plots
- Scales, facets, themes and non-data ink

`ggplot2` syntax

library(ggplot2)
ggplot()

`ggplot2` syntax

ggplot() +
    geom_violin(data = mtcars, 
                mapping = aes(x = factor(cyl), y = hp))

`ggplot2` syntax

ggplot() +
    geom_violin(data = mtcars, 
                mapping = aes(x = factor(cyl), y = hp)) +
    geom_jitter(data = mtcars, 
               mapping = aes(x = factor(cyl), y = hp))

`ggplot2` syntax

ggplot(data = mtcars, 
       mapping = aes(x = factor(cyl), y = hp)) +
    geom_violin() +
    geom_jitter()

`ggplot2` syntax

ggplot(data = mtcars, 
       mapping = aes(x = factor(cyl), y = hp)) +
    geom_violin() +
    geom_jitter() +
    labs(title = "Horsepower vs. Cylinder", x = "Cylinder", 
         y = "Horsepower")

`ggplot2` syntax

ggplot(data = mtcars, 
       mapping = aes(x = factor(cyl), y = hp)) +
    geom_violin() +
    geom_jitter() +
    labs(title = "Horsepower vs. Cylinder", x = "Cylinder", 
         y = "Horsepower") +
    theme_classic()

Agenda for today

Transforming data with dplyr (and %>% syntax)

Why do we need to transform data?

We rarely get data in exactly the form we need!

Need to create new variables or summaries
Need to rename or reorder variables

Transforming data in R is made easy by the dplyr package (“official” cheat sheet available here).

The 5 basic `dplyr` verbs

select(): pick variables by their names
mutate(): create new variables based on existing ones
arrange(): reorder rows
filter(): pick observations by their values
summarize(): collapse many values down to a single summary

Toy example: Student scores

scores

##     Name Gender English Math Science History Spanish
## 1 Andrew      M      60   96      80      56      77
## 2   John      M      66   55      56      64      77
## 3   Mary      F      92   63      70      62      98
## 4   Jane      F      80   76      89      55      40
## 5    Bob      M      80   80      82      48      50
## 6    Dan      M      58   52      79      90      61

`select`: pick subset of variables/columns by name

History teacher: “I just want their names and History scores”

`select`: pick subset of variables/columns by name

History teacher: “I just want their names and History scores”

Take the scores dataset.
Select the Name and History columns.

`select`: pick subset of variables/columns by name

History teacher: “I just want their names and History scores”

Take the scores dataset.
Select the Name and History columns.

scores %>%
    select(Name, History)

##     Name History
## 1 Andrew      56
## 2   John      64
## 3   Mary      62
## 4   Jane      55
## 5    Bob      48
## 6    Dan      90

`mutate`: create new columns based on old ones

Form teacher: “What are their total scores?”

`mutate`: create new columns based on old ones

Form teacher: “What are their total scores?”

Take the scores dataset.
Add a new column by mutating existing columns: for each row, Total = English + Math + Science + History + Spanish

`mutate`: create new columns based on old ones

Form teacher: “What are their total scores?”

Take the scores dataset.
Add a new column by mutating existing columns: for each row, Total = English + Math + Science + History + Spanish

scores <- scores %>%
    mutate(Total = English + Math + Science + History + Spanish)
scores

##     Name Gender English Math Science History Spanish Total
## 1 Andrew      M      60   96      80      56      77   369
## 2   John      M      66   55      56      64      77   318
## 3   Mary      F      92   63      70      62      98   385
## 4   Jane      F      80   76      89      55      40   340
## 5    Bob      M      80   80      82      48      50   340
## 6    Dan      M      58   52      79      90      61   340

`arrange`: reorder rows

Form teacher: “Can I have the students in order of overall performance?”

`arrange`: reorder rows

Form teacher: “Can I have the students in order of overall performance?”

Take the scores dataset.
Arrange rows by the Total column.

`arrange`: reorder rows

Form teacher: “Can I have the students in order of overall performance?”

Take the scores dataset.
Arrange rows by the Total column.

scores %>%
    arrange(Total)

##     Name Gender English Math Science History Spanish Total
## 1   John      M      66   55      56      64      77   318
## 2   Jane      F      80   76      89      55      40   340
## 3    Bob      M      80   80      82      48      50   340
## 4    Dan      M      58   52      79      90      61   340
## 5 Andrew      M      60   96      80      56      77   369
## 6   Mary      F      92   63      70      62      98   385

`arrange`: reorder rows

Form teacher: “No no, better students on top please…”

`arrange`: reorder rows

Form teacher: “No no, better students on top please…”

Take the scores dataset.
Arrange rows by the Total column, but in descending order.

`arrange`: reorder rows

Form teacher: “No no, better students on top please…”

Take the scores dataset.
Arrange rows by the Total column, but in descending order.

scores %>%
    arrange(desc(Total))

##     Name Gender English Math Science History Spanish Total
## 1   Mary      F      92   63      70      62      98   385
## 2 Andrew      M      60   96      80      56      77   369
## 3   Jane      F      80   76      89      55      40   340
## 4    Bob      M      80   80      82      48      50   340
## 5    Dan      M      58   52      79      90      61   340
## 6   John      M      66   55      56      64      77   318

`arrange`: reorder rows

Form teacher: “Can I have them in descending order of total scores, but if students tie, then by alphabetical order?”

`arrange`: reorder rows

Form teacher: “Can I have them in descending order of total scores, but if students tie, then by alphabetical order?”

Take the scores dataset.
Arrange rows first by the Total column in descending order, then by the Name column (in ascending order).

`arrange`: reorder rows

Form teacher: “Can I have them in descending order of total scores, but if students tie, then by alphabetical order?”

Take the scores dataset.
Arrange rows first by the Total column in descending order, then by the Name column (in ascending order).

scores %>%
    arrange(desc(Total), Name)

##     Name Gender English Math Science History Spanish Total
## 1   Mary      F      92   63      70      62      98   385
## 2 Andrew      M      60   96      80      56      77   369
## 3    Bob      M      80   80      82      48      50   340
## 4    Dan      M      58   52      79      90      61   340
## 5   Jane      F      80   76      89      55      40   340
## 6   John      M      66   55      56      64      77   318

`filter`: pick observations by their values

History teacher: “I want to see which students scored less than 60 for history”

`filter`: pick observations by their values

History teacher: “I want to see which students scored less than 60 for history”

Take the scores dataset.
Filter for rows where the value in the History column is less than 60.

`filter`: pick observations by their values

History teacher: “I want to see which students scored less than 60 for history”

Take the scores dataset.
Filter for rows where the value in the History column is less than 60.

scores %>%
    filter(History < 60)

##     Name Gender English Math Science History Spanish Total
## 1 Andrew      M      60   96      80      56      77   369
## 2   Jane      F      80   76      89      55      40   340
## 3    Bob      M      80   80      82      48      50   340

More on filters

Other ways to make comparisons:

>: greater than
<: less than
>=: greater than or equal to
<=: less than or equal to
!=: not equal to
==: equal to (Do not use = to test for equality!!)

More on filters

Other ways to make comparisons:

>: greater than
<: less than
>=: greater than or equal to
<=: less than or equal to
!=: not equal to
==: equal to (Do not use = to test for equality!!)

Combining comparisons:

!: not
&: and
|: or

More `filter` examples

Dan’s parents: “I just want Dan’s scores”

More `filter` examples

Dan’s parents: “I just want Dan’s scores”

scores %>% 
    filter(Name == "Dan")

##   Name Gender English Math Science History Spanish Total
## 1  Dan      M      58   52      79      90      61   340

More `filter` examples

Dan’s parents: “I just want Dan’s scores”

scores %>% 
    filter(Name == "Dan")

##   Name Gender English Math Science History Spanish Total
## 1  Dan      M      58   52      79      90      61   340

Language teacher: “I want to know which students score < 50 for either English or Spanish”

More `filter` examples

Dan’s parents: “I just want Dan’s scores”

scores %>% 
    filter(Name == "Dan")

##   Name Gender English Math Science History Spanish Total
## 1  Dan      M      58   52      79      90      61   340

Language teacher: “I want to know which students score < 50 for either English or Spanish”

scores %>% 
    filter(English < 50 | Spanish < 50)

##   Name Gender English Math Science History Spanish Total
## 1 Jane      F      80   76      89      55      40   340

`summarize`: get summaries of data

Academic: “I want to know the correlation between math and science scores”

`summarize`: get summaries of data

Academic: “I want to know the correlation between math and science scores”

Take the scores dataset.
Summarize the dataset by taking the correlation of the Math and Science columns.

`summarize`: get summaries of data

Academic: “I want to know the correlation between math and science scores”

Take the scores dataset.
Summarize the dataset by taking the correlation of the Math and Science columns.

scores %>%
    summarize(corr = cor(Math, Science))

##        corr
## 1 0.5470561

`summarize`: get summaries of data

Science teacher: “I want to know the mean and standard deviation of the scores for science”

`summarize`: get summaries of data

Science teacher: “I want to know the mean and standard deviation of the scores for science”

Take the scores dataset.
Summarize the dataset by taking the mean of the Science column and the standard deviation of the science column.

`summarize`: get summaries of data

Science teacher: “I want to know the mean and standard deviation of the scores for science”

Take the scores dataset.
Summarize the dataset by taking the mean of the Science column and the standard deviation of the science column.

scores %>%
    summarize(Science_mean = mean(Science), 
              Science_sd = sd(Science))

##   Science_mean Science_sd
## 1           76   11.54123

Chaining multiple `dplyr` commands using `%>%`

Science teacher: “I want to know which students scored > 80 for Science, but I just want names”

Chaining multiple `dplyr` commands using `%>%`

Science teacher: “I want to know which students scored > 80 for Science, but I just want names”

Take the scores dataset.
Filter for the rows which have Science > 80.
Select the Name column.

Chaining multiple `dplyr` commands using `%>%`

Science teacher: “I want to know which students scored > 80 for Science, but I just want names”

Take the scores dataset.
Filter for the rows which have Science > 80.
Select the Name column.

scores %>%
    filter(Science > 80) %>%
    select(Name)

##   Name
## 1 Jane
## 2  Bob

`group_by`: use `dplyr` verbs on a group-by-group basis

Academic: “I want to know if the boys scored better than the girls in Spanish”

`group_by`: use `dplyr` verbs on a group-by-group basis

Academic: “I want to know if the boys scored better than the girls in Spanish”

Take the scores dataset.
Group the dataset rows by gender.
Summarize each group of the dataset by taking the mean of the Spanish column.

`group_by`: use `dplyr` verbs on a group-by-group basis

Academic: “I want to know if the boys scored better than the girls in Spanish”

Take the scores dataset.
Group the dataset rows by gender.
Summarize each group of the dataset by taking the mean of the Spanish column.

scores %>%
    group_by(Gender) %>%
    summarize(Spanish_mean = mean(Spanish))

## # A tibble: 2 x 2
##   Gender Spanish_mean
##   <chr>         <dbl>
## 1 F              69  
## 2 M              66.2

Practice 1

Language teacher: “I want to know which students scored < 70 for both Spanish, but I just want names”

Practice 1

Language teacher: “I want to know which students scored < 70 for both Spanish, but I just want names”

Take the scores dataset.
Filter for the rows which have Spanish < 70.

scores %>%
    filter(Spanish < 70)

##   Name Gender English Math Science History Spanish Total
## 1 Jane      F      80   76      89      55      40   340
## 2  Bob      M      80   80      82      48      50   340
## 3  Dan      M      58   52      79      90      61   340

Practice 2

Language teacher: “I want to know which students scored < 70 for both English and Spanish, but I just want names”

Practice 2

Language teacher: “I want to know which students scored < 70 for both English and Spanish, but I just want names”

Take the scores dataset.
Filter for the rows which have English < 70 and Spanish < 70.
Select the Name column.

scores %>%
    filter(English < 70 & Spanish < 70) %>%
    select(Name)

##   Name
## 1  Dan

Practice 3

Math teacher: “I want to know the lowest Math score for each Gender”

Practice 3

Math teacher: “I want to know the lowest Math score for each Gender”

Take the scores dataset.
Group the dataset rows by gender.
Summarize each group of the dataset by taking the minimum of the Math column.

scores %>%
    group_by(Gender) %>%
    summarize(min_math = min(Math))

## # A tibble: 2 x 2
##   Gender min_math
##   <chr>     <dbl>
## 1 F            63
## 2 M            52

Practice 4

History teacher: “I want the names of students with their history scores, with the entries sorted by name”

Practice 4

History teacher: “I want the names of students with their history scores, with the entries sorted by name”

Take the scores dataset.
Arrange the dataset rows by the name column.
Select the Name and History columns.

scores %>%
    arrange(Name) %>%
    select(Name, History)

##     Name History
## 1 Andrew      56
## 2    Bob      48
## 3    Dan      90
## 4   Jane      55
## 5   John      64
## 6   Mary      62

Today’s dataset: Flights from New York City

Exploring flights from NYC to the Bay Area

(Source: Sunday Express)

Optional material

`transmute`: create new columns based on old ones, discard old ones

Form teacher: “I just want the mean score for each student”

scores %>% 
    transmute(mean = (English + Math + Science + History + Spanish) / 5)

TRUE/FALSE statements

How does R understand the code filter(History < 60)?

R checks each row of the dataset: is History less than 60 or not?
- History < 60 is a statement that is either TRUE or FALSE
- If TRUE, keep the row
- If not, discard it
filter(<condition>) only returns the rows for which <condition> is TRUE
Technical term for a statement that is either TRUE or FALSE: boolean expression

TRUE/FALSE examples

3 > 2

## [1] TRUE

3 < 2

## [1] FALSE

3 == 2

## [1] FALSE

c(1, 2, 3, 1) == c(3, 2, 1, 2)

## [1] FALSE  TRUE FALSE FALSE

c(1, 2, 3, 1) == 1

## [1]  TRUE FALSE FALSE  TRUE

Be careful with `NA`s!

1 == NA

## [1] NA

NA == NA

## [1] NA

is.na(NA)

## [1] TRUE

More on `%>%`

%>% is implemented by the magrittr package
When the dplyr package is loaded, magrittr is loaded too
%>% is “syntactic sugar”: makes code easier to understand
Whatever is on the left of %>% becomes the first argument in the function on the right of %>%
E.g. head(mtcars, n = 6) is equivalent to mtcars %>% head(n = 6)

STATS 32 Session 5: Data Transformation

Recap of week 2: data visualization

ggplot2 syntax

ggplot2 syntax

ggplot2 syntax

ggplot2 syntax

ggplot2 syntax

ggplot2 syntax

Agenda for today

Why do we need to transform data?

The 5 basic dplyr verbs

Toy example: Student scores

select: pick subset of variables/columns by name

select: pick subset of variables/columns by name

select: pick subset of variables/columns by name

mutate: create new columns based on old ones

mutate: create new columns based on old ones

mutate: create new columns based on old ones

arrange: reorder rows

arrange: reorder rows

arrange: reorder rows

arrange: reorder rows

arrange: reorder rows

arrange: reorder rows

arrange: reorder rows

arrange: reorder rows

arrange: reorder rows

filter: pick observations by their values

filter: pick observations by their values

filter: pick observations by their values

More on filters

More on filters

More filter examples

More filter examples

More filter examples

More filter examples

summarize: get summaries of data

summarize: get summaries of data

summarize: get summaries of data

summarize: get summaries of data

summarize: get summaries of data

summarize: get summaries of data

Chaining multiple dplyr commands using %>%

Chaining multiple dplyr commands using %>%

Chaining multiple dplyr commands using %>%

group_by: use dplyr verbs on a group-by-group basis

group_by: use dplyr verbs on a group-by-group basis

group_by: use dplyr verbs on a group-by-group basis

Practice 1

Practice 1

Practice 2

Practice 2

Practice 3

Practice 3

Practice 4

Practice 4

Today’s dataset: Flights from New York City

transmute: create new columns based on old ones, discard old ones

TRUE/FALSE statements

TRUE/FALSE examples

Be careful with NAs!

More on %>%

`ggplot2` syntax

`ggplot2` syntax

`ggplot2` syntax

`ggplot2` syntax

`ggplot2` syntax

`ggplot2` syntax

The 5 basic `dplyr` verbs

`select`: pick subset of variables/columns by name

`select`: pick subset of variables/columns by name

`select`: pick subset of variables/columns by name

`mutate`: create new columns based on old ones

`mutate`: create new columns based on old ones

`mutate`: create new columns based on old ones

`arrange`: reorder rows

`arrange`: reorder rows

`arrange`: reorder rows

`arrange`: reorder rows

`arrange`: reorder rows

`arrange`: reorder rows

`arrange`: reorder rows

`arrange`: reorder rows

`arrange`: reorder rows

`filter`: pick observations by their values

`filter`: pick observations by their values

`filter`: pick observations by their values

More `filter` examples

More `filter` examples

More `filter` examples

More `filter` examples

`summarize`: get summaries of data

`summarize`: get summaries of data

`summarize`: get summaries of data

`summarize`: get summaries of data

`summarize`: get summaries of data

`summarize`: get summaries of data

Chaining multiple `dplyr` commands using `%>%`

Chaining multiple `dplyr` commands using `%>%`

Chaining multiple `dplyr` commands using `%>%`

`group_by`: use `dplyr` verbs on a group-by-group basis

`group_by`: use `dplyr` verbs on a group-by-group basis

`group_by`: use `dplyr` verbs on a group-by-group basis

`transmute`: create new columns based on old ones, discard old ones

Be careful with `NA`s!

More on `%>%`